Skip to content

Conversation

@wecharyu
Copy link

@wecharyu wecharyu commented Dec 12, 2025

Rationale for this change

Limit the row group size in bytes

What changes are included in this PR?

Add a new config parquet::WriterProperties::max_row_group_bytes.

Are these changes tested?

Yes, add unit test.

Are there any user-facing changes?

Yes, user could use the new config to limit row group size.

@wecharyu wecharyu requested a review from wgtmac as a code owner December 12, 2025 07:56
@github-actions
Copy link

⚠️ GitHub issue #48467 has been automatically assigned in GitHub to PR creator.

@tusharbhatt7
Copy link

Rationale for this change

Limit the row group size.

What changes are included in this PR?

Add a new config parquet::WriterProperties::max_row_group_bytes.

Are these changes tested?

Yes, add unit test.

Are there any user-facing changes?

Yes, user could use the new config to limit row group size.

Thanks for working on this! Since I'm still new to the Arrow codebase, I reviewed the PR at a high level and it helped me understand how WriterProperties and row group configuration are implemented. I don’t have enough experience yet to provide a full technical review, but the approach looks consistent with the design discussed in the issue.

Thanks again for sharing this!

Copy link
Contributor

@HuaHuaY HuaHuaY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return contents_->total_compressed_bytes_written();
}

int64_t RowGroupWriter::current_buffered_bytes() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is a little misleading because readers may think it is same as contents_->estimated_buffered_value_bytes().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to total_buffered_bytes()

chunk_size = this->properties().max_row_group_length();
}
// max_row_group_bytes is applied only after the row group has accumulated data.
if (row_group_writer_ != nullptr && row_group_writer_->num_rows() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_group_writer_->num_rows() > 0 can only happen when the current row group writer is in the buffered mode. Usually users calling WriteTable will never use buffered mode so this approach seems not working in the majority of cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, can we gather this information from all written row groups (if available)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac If user use the static WriteTable function, the arrow FileWriter is always recreated and we can not gather the old written row groups.

Status WriteTable(const ::arrow::Table& table, ::arrow::MemoryPool* pool,
std::shared_ptr<::arrow::io::OutputStream> sink, int64_t chunk_size,
std::shared_ptr<WriterProperties> properties,
std::shared_ptr<ArrowWriterProperties> arrow_properties) {
std::unique_ptr<FileWriter> writer;
ARROW_ASSIGN_OR_RAISE(
writer, FileWriter::Open(*table.schema(), pool, std::move(sink),
std::move(properties), std::move(arrow_properties)));
RETURN_NOT_OK(writer->WriteTable(table, chunk_size));
return writer->Close();
}

If user use the internal WriteTable function, we can get avg_row_bytes by last row_group_writer_ or gathering all previous row group writers.

Status WriteTable(const Table& table, int64_t chunk_size) override {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this estimation does not help because in most cases WriteTable will not be used in the buffered mode. See my suggestion in the below comment.

@wecharyu
Copy link
Author

@wgtmac could you please take a look again?

@wgtmac
Copy link
Member

wgtmac commented Jan 7, 2026

Sorry for the delay! I will review this later this week.

chunk_size = this->properties().max_row_group_length();
}
// max_row_group_bytes is applied only after the row group has accumulated data.
if (row_group_writer_ != nullptr && row_group_writer_->num_rows() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this estimation does not help because in most cases WriteTable will not be used in the buffered mode. See my suggestion in the below comment.

if (auto avg_row_size = EstimateCompressedBytesPerRow()) {
chunk_size = std::min(
chunk_size, static_cast<int64_t>(this->properties().max_row_group_bytes() /
avg_row_size.value()));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk_size could be 0 if the configured max_row_group_bytes is less than avg_row_size, do we need a double check here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clamp the chunk size between 1 and max_row_group_bytes/avg_row_size.

int64_t buffered_bytes = row_group_writer_->EstimatedTotalCompressedBytes();
batch_size = std::min(
batch_size, static_cast<int64_t>((max_row_group_bytes - buffered_bytes) /
avg_row_size.value()));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Comment on lines 153 to 159
virtual int64_t num_rows() const = 0;
virtual int64_t compressed_bytes() const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
virtual int64_t num_rows() const = 0;
virtual int64_t compressed_bytes() const = 0;
virtual int64_t compressed_bytes() const = 0;
virtual int64_t num_rows() const = 0;

This order looks more natural :)

void AddKeyValueMetadata(
const std::shared_ptr<const KeyValueMetadata>& key_value_metadata);

/// Estimate compressed bytes per row from closed row groups.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Estimate compressed bytes per row from closed row groups.
/// \brief Estimate compressed bytes per row from closed row groups.
/// \return Estimated bytes or std::nullopt when no written row group.

const std::shared_ptr<WriterProperties> properties_;
int num_row_groups_;
int64_t num_rows_;
int64_t compressed_bytes_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename to written_row_group_compressed_bytes_ to be more clear? Or written_compressed_bytes_ if previous one is too long.

Comment on lines 142 to 143
/// \brief Estimate compressed bytes per row from closed row groups or the active row
/// group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// \brief Estimate compressed bytes per row from closed row groups or the active row
/// group.
/// \brief Estimate compressed bytes per row from data written so far.
/// \note std::nullopt will be returned if there is no row written.

if (chunk_size <= 0 && table.num_rows() > 0) {
return Status::Invalid("chunk size per row_group must be greater than 0");
} else if (!table.schema()->Equals(*schema_, false)) {
return Status::Invalid("rows per row_group must be greater than 0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Status::Invalid("rows per row_group must be greater than 0");
return Status::Invalid("chunk size per row_group must be greater than 0");

if (auto avg_row_size = EstimateCompressedBytesPerRow()) {
chunk_size = std::min(
chunk_size, static_cast<int64_t>(this->properties().max_row_group_bytes() /
avg_row_size.value()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clamp the chunk size between 1 and max_row_group_bytes/avg_row_size.

RETURN_NOT_OK(WriteBatch(offset, batch_size));
offset += batch_size;
} else if (offset < batch.num_rows()) {
// Current row group is full, write remaining rows in a new group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it cause infinite loop at this line if batch_size is always 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would cause infinite loop only when the max_row_group_bytes/avg_row_size is 0, is it OK to return Invalid status in WriteXxx() at this case?

        if (batch_size == 0 && row_group_writer_->num_rows() == 0) {
          return Status::Invalid(
              "Configured max_row_group_bytes is too small to hold a single row");
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot accept infinite loop so perhaps we have to set the minimum batch size to 1 in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the minimum batch size to 1 is not reasonable, when the buffered_bytes > max_row_group_bytes we still set the batch size as 1, then it will continually append one row to the active row group and never create a new one. Returning an invalid status might be more intuitive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check row group size after writing each batch? If a large per row size leads to batch size equal to 1, we just end up with checking row group size after writing every row.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We does not check row group size after write batch, current write logic is like:

  1. check rows and bytes to determine current batch_size
  2. if batch_size > 0, write these rows to current row group, it's guaranteed not exceeds the row group limits
  3. if batch_size = 0 and still has rows to write, new a row group
  4. next loop to 1

In this way we don't need check size after written, it's guaranteed in step 1; and we'll not leave an possible empty row group in the final batch, it guaranteed in step 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you achieve the step 4 above next loop to 1? This looks a bit complicated to me. How about the logic below.

For each loop iteration:

  1. Check the row group threshold (both rows and bytes) and append a new row group if needed.
  2. Compute a new batch_size based on different conditions and set its minimum to 1 so we don't get empty batch.
  3. Write the batch as before.

@wgtmac
Copy link
Member

wgtmac commented Jan 15, 2026

Do you want to take a look at this PR? It may affect the default behavior of row group size. @pitrou

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 15, 2026
int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
return contents_->total_compressed_bytes() +
contents_->total_compressed_bytes_written() +
contents_->EstimatedBufferedValueBytes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EstimatedBufferedValueBytes does not account for compression and may therefore wildly overestimate the final compressed size?

Are we sure we want to account for contents not serialized into a page yet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This encoding size is a reference before the first page written, and its impact diminishes as more pages are written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that makes it useful in any way, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many common cases, the compression ratio is close to 3:1. So I used something like total_compressed_bytes + total_compressed_bytes_written + EstimatedBufferedValueBytes / (codec_type != NONE ? 3 : 1) as an empirical value in the past.
`

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this logic here? we can do it in EstimatedBufferedValueBytes::Contents::EstimatedBufferedValueBytes(), and rename it like EstimatedCompressedBufferedBytes().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou 3 also will not cause huge over-estimation as long as there are many pages. I think the only difference is:
1 will produce row group that over the max_row_group_bytes while 3 produce smaller one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is that 3 is more complicated and more fragile. So I would rather we do 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we choose 1, we can change to estimate batch size based on avg row size and written row numbers to avoid ignoring too many buffered bytes like:

    while (offset < batch.num_rows()) {
      auto avg_row_size = EstimateCompressedBytesPerRow();
      int64_t max_rows =
          avg_row_size
              ? std::min(max_row_group_length,
                         // Ensure batch_size is at least 1 to avoid infinite loops.
                         std::max(1L, static_cast<int64_t>(max_row_group_bytes /
                                                           avg_row_size.value())))
              : max_row_group_length;
      if (row_group_writer_->num_rows() >= max_rows) {
        // Current row group is full, start a new one.
        RETURN_NOT_OK(NewBufferedRowGroup());
      }
      int64_t batch_size =
          std::min(max_rows - row_group_writer_->num_rows(), batch.num_rows() - offset);
      RETURN_NOT_OK(WriteBatch(offset, batch_size));
      offset += batch_size;
    }

The last concern is that the max_row_group_bytes could not take effect before the first page is written, is it acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that in most common cases, buffered values contribute to only a small fraction of the total row group size. There is a caveat in this approach: dict entries are buffered and thus not counted in this case. If we have many columns, we may underestimate a lot because dictionary encoding is enabled by default.

If it is too hard to decide, how about providing a config to let users to choose which to use: written pages only, or written pages plus buffered values?

If we choose 1, we can change to estimate batch size based on avg row size and written row numbers to avoid ignoring too many buffered bytes.

This approach also have obvious caveat since data pages usually do not have same boundary of rows.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer 3 too, this is also the approach taken by parquet-java. And it makes the final row group size smaller than max_row_group_bytes, which is intuitive.

static constexpr int64_t DEFAULT_DICTIONARY_PAGE_SIZE_LIMIT = kDefaultDataPageSize;
static constexpr int64_t DEFAULT_WRITE_BATCH_SIZE = 1024;
static constexpr int64_t DEFAULT_MAX_ROW_GROUP_LENGTH = 1024 * 1024;
static constexpr int64_t DEFAULT_MAX_ROW_GROUP_BYTES = 128 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for this value? AFAIK some Parquet implementation (is it Parquet Rust? @alamb ) writes a single row group per file by default.

I also feel like the HDFS-related reasons in the Parquet docs are completely outdated (who cares about HDFS?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think smaller row groups are still useful when pruning is essential. https://www.firebolt.io/blog/unlocking-faster-iceberg-queries-the-writer-optimizations-you-are-missing is a good read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the value is not easy to devise. For example, if you have 10_000 columns, this will make for some very short columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we could set a very large value as default to keep the current behavior. Some engines are smart enough to derive a good threshold (from history or whatever source).

Copy link
Contributor

@alamb alamb Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for this value? AFAIK some Parquet implementation (is it Parquet Rust? @alamb ) writes a single row group per file by default.

The default row group size in the rust writer is 1M rows (1024*1024) -- NOT bytes

https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_max_row_group_size

I looked through and didn't find any setting for max row group size in bytes.

I believe at least at some point in the past, the DuckDB Parquet writer wrote a single large row group -- I am not sure if that is the current behavior or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Apache Impala also produces single row group in its Parquet writer. However I think limiting row group size in bytes is still useful in some cases. For example, there is a table property in Iceberg: https://github.com/apache/iceberg/blob/73a26fc1f49e6749656a273b2e4d78eb9e64f19e/docs/docs/configuration.md?plain=1#L46. As iceberg-cpp is depending on the Parquet writer here, it is nice to support this feature.

Comment on lines 153 to 159
virtual int64_t num_rows() const = 0;
virtual int64_t written_compressed_bytes() const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
virtual int64_t num_rows() const = 0;
virtual int64_t written_compressed_bytes() const = 0;
virtual int64_t written_compressed_bytes() const = 0;
virtual int64_t num_rows() const = 0;

int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
return contents_->total_compressed_bytes() +
contents_->total_compressed_bytes_written() +
contents_->EstimatedBufferedValueBytes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many common cases, the compression ratio is close to 3:1. So I used something like total_compressed_bytes + total_compressed_bytes_written + EstimatedBufferedValueBytes / (codec_type != NONE ? 3 : 1) as an empirical value in the past.
`

RETURN_NOT_OK(WriteBatch(offset, batch_size));
offset += batch_size;
} else if (offset < batch.num_rows()) {
// Current row group is full, write remaining rows in a new group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot accept infinite loop so perhaps we have to set the minimum batch size to 1 in this case?


/// \brief Estimate compressed bytes per row from closed row groups.
/// \return Estimated bytes or std::nullopt when no written row group.
std::optional<double> EstimateCompressedBytesPerRow() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR: perhaps it is also useful to provide an estimation of the current file size to facilitate downstream to implement a rolling file writer.

@alamb alamb changed the title GH-48467: [C++][Parquet] Add configure to limit the row group size GH-48467: [C++][Parquet] Add configure to limit the row group size in bytes Jan 16, 2026
@alamb
Copy link
Contributor

alamb commented Jan 16, 2026

I also updated the title of this PR to make it clear it was adding a row group size limit in bytes

static constexpr int64_t DEFAULT_DICTIONARY_PAGE_SIZE_LIMIT = kDefaultDataPageSize;
static constexpr int64_t DEFAULT_WRITE_BATCH_SIZE = 1024;
static constexpr int64_t DEFAULT_MAX_ROW_GROUP_LENGTH = 1024 * 1024;
static constexpr int64_t DEFAULT_MAX_ROW_GROUP_BYTES = 128 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Apache Impala also produces single row group in its Parquet writer. However I think limiting row group size in bytes is still useful in some cases. For example, there is a table property in Iceberg: https://github.com/apache/iceberg/blob/73a26fc1f49e6749656a273b2e4d78eb9e64f19e/docs/docs/configuration.md?plain=1#L46. As iceberg-cpp is depending on the Parquet writer here, it is nice to support this feature.

virtual int64_t total_compressed_bytes() const = 0;
/// \brief total compressed bytes written by the page writer
virtual int64_t total_compressed_bytes_written() const = 0;
/// \brief estimated bytes of values that are buffered by the page writer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// \brief estimated bytes of values that are buffered by the page writer
/// \brief Estimated bytes of values that are buffered by the page writer

int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
return contents_->total_compressed_bytes() +
contents_->total_compressed_bytes_written() +
contents_->EstimatedBufferedValueBytes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contents_->EstimatedBufferedValueBytes() may under-estimate the size because it only accounts for the values buffered by current_encoder_, and ignores buffered values in the repetition_levels_rle_, definition_levels_rle_ and current_dict_encoder_. For deep nested types, level values may be significantly larger than the values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, buffered levels size is available from definition_levels_sink_ and repetition_levels_sink_.

} else if (chunk_size > this->properties().max_row_group_length()) {
chunk_size = this->properties().max_row_group_length();
}
if (auto avg_row_size = EstimateCompressedBytesPerRow()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check if avg_row_size is 0 or NaN to skip the calculation below. Perhaps we can check this in EstimateCompressedBytesPerRow() to return std::nullopt.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EstimateCompressedBytesPerRow() return calculated value only when the num_rows > 0, so it's never 0.

RETURN_NOT_OK(WriteBatch(offset, batch_size));
offset += batch_size;
} else if (offset < batch.num_rows()) {
// Current row group is full, write remaining rows in a new group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you achieve the step 4 above next loop to 1? This looks a bit complicated to me. How about the logic below.

For each loop iteration:

  1. Check the row group threshold (both rows and bytes) and append a new row group if needed.
  2. Compute a new batch_size based on different conditions and set its minimum to 1 so we don't get empty batch.
  3. Write the batch as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants